Skip to content

zap: TinyZAP for multi-uint64 entries.#18568

Open
akashb-22 wants to merge 1 commit into
openzfs:masterfrom
akashb-22:tinyzap_blob2
Open

zap: TinyZAP for multi-uint64 entries.#18568
akashb-22 wants to merge 1 commit into
openzfs:masterfrom
akashb-22:tinyzap_blob2

Conversation

@akashb-22
Copy link
Copy Markdown
Contributor

@akashb-22 akashb-22 commented May 20, 2026

Introduce TinyZAP, a new on-disk ZAP format between MicroZAP and FatZAP. MicroZAP is limited to 1xuint64 values and 49-char keys, any wider entry forces a full FatZAP upgrade. TinyZAP avoids this for the common case of multi-integer values (e.g., Lustre FIDs) and long key names.

Signed-off-by: Akash B akash-b@hpe.com

Motivation and Context

This PR introduces TinyZAP, a new on-disk ZAP format that sits between MicroZAP and FatZAP in the ZAP format. TinyZAP extends MicroZAP to efficiently handle multi-word values and long key names without the overhead of a full FatZAP upgrade.
The primary motivation is workloads like Lustre that store multi-integer values (e.g., FIDs: 2-3 x uint64_t) or long filenames in ZAP objects. Previously, these always created a FatZAP, consuming significantly more on-disk space and memory than necessary.

ZAP Format Hierarchy (After This Change)
MicroZAP -> TinyZAP -> FatZAP
However, internally TinyZAP is also a MicroZAP (but an extended format). TinyZAP is an in-place extension of MicroZAP that supports multi-integer values and longer key names without upgrading to FatZAP.

Description

TinyZAP reuses the existing mzap_phys_t block format.
Three previously reserved bytes in the 64-byte header are repurposed as independent uint8_t fields:

typedef struct mzap_phys {
    uint64_t mz_block_type;   /* ZBT_MICRO */
    uint64_t mz_salt;
    uint64_t mz_normflags;
    uint8_t  mz_flags;        /* MZAP_FLAG_TINY Flag to distinguish from microzap */
    uint8_t  mz_chunk_shift;  /* log2(chunk): 6=64B, 7=128B, 8=256B */
    uint8_t  mz_value_ints;   /* num_integers; stride = mz_value_ints * 8 */
    uint8_t  mz_pad1;         /* zero */
    uint32_t mz_pad2;         /* zero */
    uint64_t mz_pad3[4];      /* zero */
    mzap_ent_phys_t mz_chunk[]; /* variable-size chunk array */
} mzap_phys_t;

When mz_flags == 0, the block is a plain MicroZAP. When MZAP_FLAG_TINY (bit 0) is set, the TinyZAP layout applies. Each chunk slot is a tzap_ent_phys_t:

[0 .. stride-1]       value blob  (mz_value_ints × uint64_t)
[stride .. stride+3]  cd          (uint32_t)
[stride+4 .. chunk-1] name        (NUL-terminated string)

Supported chunk sizes and resulting geometry (examples only).

Supported chunk sizes and resulting geometry (examples only).
name_len = chunk - stride - 4  (stride = mz_value_ints * 8)

chunk | stride | name_len | integers | use-case
------+--------+----------+----------+--------------------------------------
  64  |    8   |    52    |    1     | 1×uint64, name up to 51 chars
  64  |   16   |    44    |    2     | 2×uint64 (Lustre FID)
  64  |   24   |    36    |    3     | 3×uint64
  64  |   32   |    28    |    4     | 4×uint64
  64  |   56   |     4    |    7     | max stride for chunk=64
 128  |    8   |   116    |    1     | 1×uint64 + long name
 128  |   16   |   108    |    2     | 2×uint64 + long name
 128  |   48   |    76    |    6     | 6×uint64 (3×Lustre FID)
 128  |  120   |     4    |   15     | max stride for chunk=128
 256  |    8   |   244    |    1     | 1×uint64 + very long name
 256  |   16   |   236    |    2     | 2×uint64 + very long name
 256  |  128   |   124    |   16     | 16×uint64 (wide value, medium name)
 256  |  248   |     4    |   31     | max stride for chunk=256
 ...

Note: stride=8 with chunk=64 is skipped by tzap_try_promote() because it provides only 2 bytes more than MicroZAP. Chunk=128 is the minimum for stride=8. chunk=64 is only used when stride >= 16 (num_integers > 1).

Other details on ZAP upgrade conditions:

MicroZAP -> TinyZAP Conditions:
Promotion is attempted automatically on the first zap_add() when the entry fails the plain MicroZAP constraints. All of the following must hold:

integer_size == 8 (only uint64_t values supported)
stride = num_integers × 8 >= 8
At least one chunk size (64/128/256) can accommodate: stride + 4 + TZAP_MIN_NAME_LEN <= chunk && strlen(key) < TZAP_NAME_LEN(chunk, stride)
The pool featurecom.hpe:tinyzapis enabled

The stride is stamped once on the first qualifying add and cannot change. The smallest fitting chunk is selected automatically.
For stride=8, promotion is also allowed on a populated MicroZAP: existing entries are re-encoded in-place via tzap_reencode_micro_to_tiny(), which re-packs the fixed 64-byte MicroZAP slots into the wider TinyZAP chunk format using a buffer.

TinyZAP Chunk Upgrade (in-place, stays TinyZAP)
When a new key is too long for TZAP_NAME_LEN(chunk, stride) but fits a larger chunk size, tzap_try_chunk_upgrade() re-packs all entries into the new chunk size without upgrading to FatZAP. The chunk can grow from 64->128 or 128->256. The block is grown if needed (up to zap_micro_max_size).

TinyZAP -> FatZAP Conditions:
A FatZAP upgrade is forced when any of the following occur:

integer_size != 8
num_integers != stride / 8(value width mismatch with stamped stride)
Key is too long for TZAP_NAME_LEN(256, stride)(no chunk fits)
All chunk slots are full and the block cannot grow further

During mzap_upgrade(), existing TinyZAP entries are re-encoded into FatZAP leaf blocks via tzap_upgrade_entries(). This function reads the original block size from the sz snapshot taken before fzap_upgrade() changes db_size to 16KB, preventing iteration over ghost slots.

Plain MicroZAP -> FatZAP (Unchanged)
If TinyZAP promotion fails (no fitting chunk, integer_size != 8, geometry mismatch), the existing MicroZAP -> FatZAP path is taken.

others:
SPA Feature Flag: com.hpe:tinyzap
A new pool feature, SPA_FEATURE_TINYZAP (com.hpe:tinyzap) is introduced:

  1. Not read-only compatible: pools with TinyZAP objects cannot be imported by software that does not support this feature, even read-only.
  2. Flags: (ZFEATURE_FLAG_MOS) It's decided that the MOS entries need not use TinyZAP.
  3. The feature becomes active the first time a TinyZAP object is created and returns to enabled when all TinyZAP objects have been removed or upgraded to FatZAP.

How Has This Been Tested?

Added simple tests:
Functional test suite which tests or exercises MicroZAP->TinyZAP, chunk upgrade, TinyZAP->FatZAP, remount, readdir, collision, feature flag, etc.

Before the patch using Lustre (FatZap):

# mkdir testdir1 && touch testfile1
# du --si test*
100k    testdir1
1.1k    testfile1

Performance:

416 tasks, 1248000 files/directories
SUMMARY rate: (of 3 iterations) (op/sec)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   Directory creation          40183.030      33505.043      37718.225       3666.264
   Directory stat             150247.149     140618.623     145442.903       4814.295
   Directory removal           69091.318      55062.863      60385.987       7601.242

Total space taken by 1.25 million directories: Total Size: 117.7G (98.86 KB/Inode)

After this patch using Lustre (TinyZap):

# mkdir testdir1 && touch testfile1
# du --si test*
1.1k    testdir1
1.1k    testfile1

Total space taken by 1.25 million directories: Total Size: 3.7G (3.09 KB/Inode)

Performance:

416 tasks, 1248000 files/directories
SUMMARY rate: (of 3 iterations) (op/sec)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   Directory creation         106757.610     100377.356     104069.288       3306.408
   Directory stat             296205.362     225769.238     262180.060      35278.604
   Directory removal          155819.816     134755.537     145367.644      10533.050

These were the summary of the results overall:
For draid2:9d:12c:1s-0 (flash MDT and 4 OSTs):
Directory creation improved by +176% - over 2.75x faster, exceeding 100K ops/sec
Directory removal improved by +141% - over 2.4x faster, exceeding 145K ops/sec
Directory stat improved by +97% at peak - nearly 2× faster, approaching 300K ops/sec

Space Efficiency:
Almost 99% reduction in empty directories.
TinyZap (1-2 KB) vs. FatZAP (100-130 KB).
For 1.25 million directories (TinyZap: 3.7G (3.09 KB/Inode) vs. FATZap: 117.7G (98.86 KB/Inode)), ~32x reduction

TODO:

  1. Fix the checkstyle and other rebase-related issues.
  2. I have some of the local bash tests, which I probably will add some of to zfs-tests in the next push.
  3. Running a few tests with Lustre, may delay a little here.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@akashb-22
Copy link
Copy Markdown
Contributor Author

@behlendorf @robn ^^ Please let me know your thoughts on this.

@robn robn self-requested a review May 20, 2026 22:26
@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label May 20, 2026
@akashb-22 akashb-22 force-pushed the tinyzap_blob2 branch 2 times, most recently from e87779e to ccf54f6 Compare May 21, 2026 14:36
Copy link
Copy Markdown
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up nicely!

Comment thread module/zfs/zap.c
Comment thread module/zfs/zap.c Outdated
Comment thread module/zfs/zap_micro.c Outdated
Comment thread module/zcommon/zfeature_common.c Outdated
Comment thread module/zfs/zap_tiny.c
#
# Copyright (c) 2013, 2014 by Delphix. All rights reserved.
# Copyright 2016 Nexenta Systems, Inc. All rights reserved.
# Copyright (c) 2026, Hewlett Packard Enterprise Development LP.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rob's proposed unit test framework in #18564 would be an ideal way to exercise the new TinyZAP code. In addition to basic unit tests (add/remove/lookup) we can verify the various promotion paths behave as intended (MicroZAP -> TinyZAP -> FatZAP, etc).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the full suite isn't upstreamed yet, I'm intending to wire what I have up to this PR and see what falls out. I'll share the results soon.

But also, if this PR lands before the test suite does, I'll be sure to include coverage in the test suite. You get grandfathered in 👴

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'll check it out.
I think I'll gradually get the reviews and the required changes for this patch, and probably let's see how things go later.

Comment thread include/sys/zap_impl.h Outdated
Copy link
Copy Markdown
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally on-board with the idea, but this all seems very convoluted to me.

If I'm understanding all this correctly, is effectively the same as MicroZAP the same as a TinyZAP with chunk=6 (64B) and stride=1 (8B)?

If so, I'd suggest the code would be a lot nicer by actually making the entire implementation by about TinyZAPs (by structure if not by name), and just special case for MicroZAPs: if we don't see MZAP_FLAG_TINY, then use chunk=6, stride=1 and do the extra MZAP_NAME_LEN check in the add and upgrade paths.

If you fold all those checks and math into a small number of macros or inline functions (which you basically already) have, then it seems like this PR should be almost entirely a mechanical conversion, plus the feature flag handling code.

Because of this, my review comments are either small style nits, or design queries that I think would apply regardless of the structure. Whichever way it goes, I'll need another review round.

Comment thread include/sys/zap_impl.h Outdated
Comment thread include/sys/zap_impl.h
Comment thread module/zcommon/zfeature_common.c Outdated
Comment thread module/zcommon/zfeature_common.c Outdated
Comment thread module/zfs/zap.c Outdated
Comment thread module/zfs/zap_micro.c Outdated
Comment thread module/zfs/zap_tiny.c Outdated
Comment thread include/sys/zap_impl.h Outdated
Comment thread module/zfs/zap_impl.c
Comment thread include/sys/zap_impl.h Outdated
@akashb-22
Copy link
Copy Markdown
Contributor Author

Changes in my latest push:

  1. Support populated microzap to tinyzap upgrade and chunk upgrade (64->128->256) for tinyzap.
  2. Change the implementation of TinyZAPs (by structure). Three independent uint8_t fields (mz_flags(MZAP_FLAG_TINY), mz_chunk_shift (log2(chunk): 6=64B, 7=128B, 8=256B), mz_value_ints (stride / 8))
  3. Fix mzap_normalization_conflict to handle TinyZAP entries.
  4. Added ZFEATURE_FLAG_MOS to disable TinyZAP on the MOS.
  5. Removed tzap_should_promote entirely and all handled in tzap_try_promote.
  6. Flex array member gives compiler error ("flexible array member in struct with no named members" compiler error). Changed to tze_data[0] /* zero length array */
  7. Added basic zap testcases covering a few scenarios. (TinyZAP upgrade paths and entries, etc.)
  8. Updated the supported chunk sizes and resulting geometry comments.
  9. Added TZAP_VERIFY_PHYS(__FUNCTION__) for debug on-disk validation
  10. Other review comments and fixes.

Things to be discussed:?

Flexible array member error. Changed it to [0]. (-fsanitize=bounds in debug builds, [0] is seems to be the correct choice)
 zfs/include/sys/zap_impl.h:200:17: error: flexible array member in a struct with no named members
  200 |         uint8_t tze_data[]; /* variable size */
      |                 ^~~~~~~~

@akashb-22 akashb-22 force-pushed the tinyzap_blob2 branch 2 times, most recently from 8147b1a to 5ef112e Compare May 26, 2026 17:41
@amotin
Copy link
Copy Markdown
Member

amotin commented May 26, 2026

Sorry if already mentioned, but I suppose this feature will not only be a read-incompatible, but also a send/receive incompatible with older receivers. While I was also thinking about some more efficient ZAP formats for purposes for BRT/DDT, read-incompatible feature means we need to update boot loaders for all OS'es, and add some more feature flags into replication streams.

@akashb-22 akashb-22 force-pushed the tinyzap_blob2 branch 3 times, most recently from b775843 to c8fd312 Compare June 1, 2026 03:57
MicroZAP is limited to 1×uint64 values and 49-char keys, any wider
entry forces a full FatZAP upgrade.  TinyZAP avoids this for the
common case of multi-integer values (e.g. Lustre FIDs) and long keys.

Introduce TinyZAP, a MicroZAP variant reuses mzap_phys_t, repurposing
the padding bytes after mz_normflags as three independent
uint8_t fields:

  mz_flags        bit 0 = MZAP_FLAG_TINY
  mz_chunk_shift  log2(chunk): 6=64B, 7=128B, 8=256B
  mz_value_ints   stride / 8  (number of uint64 values per entry)

Geometry is stamped automatically on the first zap_add() based on
observed entry shape. no create-time hint is required.  Subsequent
adds must match the stamped geometry or a FatZAP upgrade is triggered.

All ZAP operations (add, update, remove, lookup, cursor, byteswap,
upgrade to FatZAP) dispatch to TinyZAP paths when zap_stride != 0.

Signed-off-by: Akash B <akash-b@hpe.com>
@akashb-22
Copy link
Copy Markdown
Contributor Author

Updated the PR description, added new tests, and fixed the chunk upgrade paths and TinyZAP by struct comments. Another round of reviews?

@akashb-22 akashb-22 requested review from behlendorf and robn June 1, 2026 11:23
@behlendorf
Copy link
Copy Markdown
Contributor

@adilger @tim-day-387 it'd be great if you could take a look at this. I want to make sure we really understand and address Lustre's current ZAP needs and if possible anything related you might be thinking about longer term!

@adilger
Copy link
Copy Markdown
Contributor

adilger commented Jun 1, 2026

@behlendorf, for future expansion usage by Lustre Metadata Redundancy we have recently expanded the ldiskfs "dirdata" feature to allow storing multiple 16-byte FIDs into a single directory entry to reference multiple inode mirrors, similar to how ZFS dnodes can reference up to 3 block pointers.

I can't comment on the details of the implementation, but from the commit message comments it appears that this TinyZap implementation will allow this to work for ZFS as well. There may be some transition time where pre-existing directory ZAPs are not able to create new entries with multiple FIDs since the TinyZAP geometry is fixed by the first entry created in it, but that should be a relatively uncommon configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Design Review Needed Architecture or design is under discussion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants